Paper: The phylogenetic range of bacterial and viral pathogens of vertebrates preprint: https://doi.org/10.1101/670315
Authors: Liam P. Shaw*, Alethea Wang*, David Dylus, Magda Meier, Grega Pogacnik, Christophe Dessimoz, Francois Balloux (* co-first authors)
This is a supplementary R Markdown workbook (Supplementary Text 1) containing code to reproduce all main figures and analyses in the associated paper, as well as several supplementary files. Code can be shown or hidden overall (see top right) or for individual sections.
Code can be altered to change any of the figures. Some explanatory text is provided, but not much. Any questions can be addressed via email (liam.philip.shaw@gmail.com).
Unless otherwise stated, data related to viruses is plotted in red and bacteria in black. The code should take <10 minutes to run on a standard laptop.
Note: At several points code is adapted/reused from the supplementary code repository made available by Olival et. al. (2017):
This mainly applies to several scripts in the `scripts’ directory relating to the fitting and plotting of generalized additive models (GAMs) – more information is included where appropriate. All code here is also made available under an MIT License.
This section contains library requirements and defines two useful functions:
First, the data is read in: the database of host-pathogen associations (tidy format i.e. one association per row); and the host phylogenetic tree based on mitochondrial gene alignments.
# Host-pathogen association database
pathogen_vs_host_db <- read.csv('data/PathogenVsHostDB-2019-05-30.csv', stringsAsFactors = F)
# Host phylogenetic tree
host_tree <- read.tree('data/HostPhylogeneticTree-2019-05-30.nwk')
# Uncomment to use only species which do not disrupt monophyly (n=40)
#host_tree <- read.tree('HostPhylogeneticTree-2019-05-30-pruned.nwk')
# Cophenetic matrix (from phylogenetic tree)
cophenetic_matrix <- cophenetic.phylo(host_tree)
# Keep only overlap
host_tree <- drop.tip(host_tree, host_tree$tip.label[which(!host_tree$tip.label %in% pathogen_vs_host_db$HostSpeciesPHB)])
pathogen_vs_host_db <- pathogen_vs_host_db[which(pathogen_vs_host_db$HostSpeciesPHB %in% host_tree$tip.label),]
# Subset database into virus and bacteria
virus <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Virus"),]
bacteria <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Bacteria"),]
To begin with, some overall statistics which are cited in the main text.
# List of host orders
host_orders <- unique(as.character(pathogen_vs_host_db$HostOrder))
n.host.orders <- length(host_orders)
# Calculate THR for each host order
order_thr <- sapply(host_orders, function(x) THR(x, FUN=max))
median.inter.host.distance.order <- median(order_thr)
The number of host orders is 90.
The medium maximum inter-host distance within an order is 0.323.
We find that viruses are significantly more likely to be vector-borne compared with bacteria.
vector.tab <- table(pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$Species)),"VectorBorne"], pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$Species)),"Type"])
vector.tab <- vector.tab[c("No", "Yes"),]
chisq.test(vector.tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: vector.tab
## X-squared = 91.529, df = 1, p-value < 2.2e-16
We want to think about pathogen range etc. after removing them from the database and the host tree. We recalculate PHB after doing this.
# Removing humans from database
pathogen_vs_host_db_no_humans <- pathogen_vs_host_db[which(pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
t.no.humans <- drop.tip(host_tree, tip = host_tree$tip.label[!host_tree$tip.label %in% pathogen_vs_host_db_no_humans$HostSpeciesPHB])
d.no.humans <- cophenetic.phylo(t.no.humans)
# Bacteria PHB
bacteria.names <- names(sort(table(bacteria$Species), decreasing=TRUE))
bacteria.no.humans.phbs <- as.numeric(sapply(bacteria.names, function(x) PHB(x,
m=pathogen_vs_host_db_no_humans,
cophenetic.matrix = d.no.humans)))
virus.names <- names(sort(table(virus$Species), decreasing=TRUE))
virus.no.humans.phbs <- as.numeric(sapply(virus.names, function(x) PHB(x, m=pathogen_vs_host_db_no_humans,
cophenetic.matrix = d.no.humans)))
virus.no.humans.phbs.range <- as.numeric(sapply(virus.names, function(x) PHB(x, m=pathogen_vs_host_db_no_humans,
cophenetic.matrix = d.no.humans, FUN = max)))
It is useful to have data frames with the summary statistics for each pathogen species (i.e. one row per pathogen), calculated from the pathogen-host association database (one row per association). We separate this into viruses and bacteria, as we treat them separately.
options(warn=-1) # Turn off warnings
virus.names <- names(sort(table(virus$Species), decreasing=TRUE))
n.hosts <- as.numeric(sort(table(virus$Species), decreasing=TRUE))
virus.phbs <- sapply(virus.names, function(x) PHB(x))
virus.unique <- virus[which(!duplicated(virus$Species)),]
rownames(virus.unique) <- virus.unique$Species
proteins <- as.numeric(virus.unique[virus.names, "ProteinCount"])
virus.families <- virus.unique[virus.names, "Family"]
# Combine into a data frame of statistics for each viral pathogen
virus.df <- data.frame(cbind(virus.phbs, proteins))
genome.size <- as.numeric(virus.unique[virus.names, "Gsize"])
virus.df$genome.size <- genome.size
virus.df$family <- virus.families
#
n.hosts <- as.numeric(sort(table(virus$Species), decreasing=TRUE))
virus.df$n.hosts <- n.hosts
genome.type <- virus.unique[virus.names, "Genome"]
virus.df$genome.type <- genome.type # Genome type
# Genome type (just RNA/DNA)
genome.type.rna.dna <- ifelse(genome.type %in% c("(-) ssRNA", "(+) ssRNA",
"ssRNA-RT", "dsRNA"), "RNA",
ifelse(genome.type %in% c("dsDNA", "dsDNA-RT", "ssDNA"),
"DNA", ""))
virus.df$genome.type.rna.dna <- genome.type.rna.dna
zoonotic <- ifelse(virus.unique[virus.names, "Zoonotic"]=="Yes", "Zoonotic", "Not zoonotic")
virus.df$n.hosts <- n.hosts # Number of hosts
virus.df$zoonotic <- zoonotic # Zoonotic
virus.df$virus.phbs.no.human <- virus.no.humans.phbs # Add non-human mean PHB
virus.df$virus.phbs.max.no.human <- virus.no.humans.phbs.range # Add non-human maximum PHB
virus.df$vector.borne <- virus.unique[virus.names, "VectorBorne"]
# Get bacterial pathogen species names
bacteria.names <- names(sort(table(bacteria$Species), decreasing=TRUE))
# Get number of hosts for each pathogen
n.hosts <- as.numeric(sort(table(bacteria$Species), decreasing=TRUE))
# Get mean PHB for each pathogen
bacteria.phbs <- as.numeric(sapply(bacteria.names, function(x) PHB(x)))
# Dataframe of unique bacterial pathogen species
bacteria.unique <- bacteria[which(!duplicated(bacteria$Species)),]
rownames(bacteria.unique) <- bacteria.unique$Species
# Extract relevant lifestyle variables:
# motility, cell, Gram stain,
# spore formation, oxygen requirements, zoonotic
motility <- bacteria.unique[bacteria.names, "Motility"]
cell <- bacteria.unique[bacteria.names, "Cell"]
gram <- bacteria.unique[bacteria.names, "GramStain"]
spore <- bacteria.unique[bacteria.names, "Spore"]
oxygen <- bacteria.unique[bacteria.names, "Oxygen"]
zoonotic <- ifelse(bacteria.unique[bacteria.names, "Zoonotic"]=="Yes", "Zoonotic", "Not zoonotic")
# Combine this information into a single data frame of statistics for each bacterial pathogen
bacteria.df <- data.frame(cbind(bacteria.names, n.hosts, bacteria.phbs,
motility, gram, spore, oxygen, zoonotic, cell))
bacteria.df$bacteria.phbs <- as.numeric(as.character(bacteria.df$bacteria.phbs)) # Make sure numeric variable
bacteria.df$bacteria.phbs.no.human <- bacteria.no.humans.phbs # Add in PHB without humans
bacteria.df$family <- bacteria.unique[bacteria.names, "Family"] # Add taxonomic family
bacteria.df$genus <- bacteria.unique[bacteria.names, "Genus"] # Add taxonomic genus
bacteria.df$vector.borne <- bacteria.unique[bacteria.names, "VectorBorne"]
This is the schematic overview figure, and was made manually in Inkscape using images from FlatIcon (see references in main manuscript).
We get a dataset of just the unique pathogens and then calculate the mean phylogenetic host breadth (PHB) of each pathogen. We then use this to plot a histogram.
options(warn=-1) # Turn off warnings
# Get unique pathogens
unique_pathogens <- data.frame(Species=pathogen_vs_host_db[!duplicated(pathogen_vs_host_db$Species),"Species"],
Type=pathogen_vs_host_db[!duplicated(pathogen_vs_host_db$Species),"Type"])
unique_pathogens$Species <- as.character(unique_pathogens$Species)
# Calculate PHB
unique_pathogens$PHB <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db))
unique_pathogens$PHB.median <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db, FUN=median))
unique_pathogens$PHB.max <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db, FUN = max))
# Number of total hosts
host_species_counts <- data.frame(pathogen_vs_host_db %>% group_by(Species) %>% summarise(count=length(HostSpeciesPHB)))
rownames(host_species_counts) <- host_species_counts$Species
unique_pathogens$N.hosts <- host_species_counts[unique_pathogens$Species,"count"]
# Gather data
pathogen_range_histo <- data.frame( phb = 1:50/30-0.03,
bacteria = hist(filter(unique_pathogens, Type == "Bacteria")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count,
virus = hist(filter(unique_pathogens, Type == "Virus")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count) %>% gather(type, count, -phb)
pathogen_range_histo$count.norm <- pathogen_range_histo$count/sum(pathogen_range_histo$count)
pathogen_range_histo$type <- ifelse(pathogen_range_histo$type=="bacteria", "Bacteria", "Virus")
pathogen_range_histo.total <- data.frame(pathogen_range_histo %>% group_by(phb) %>% summarise(count=sum(count)))
# Pathogen range histogram, excluding zeros
pathogen_range_histo$type <- ordered(pathogen_range_histo$type, levels=c("Virus", "Bacteria"))
levels(pathogen_range_histo$type) <- c("(a) Viruses", "(b) Bacteria")
p.histogram.figure.2 <- ggplot(pathogen_range_histo, aes(fill=type, x=phb, y=count))+
geom_histogram(colour="black", aes(fill="All"), data=pathogen_range_histo.total, stat="identity", alpha=0.4)+
ylim(c(0,75))+geom_bar(stat="identity")+
facet_wrap(~type, ncol=1)+
scale_fill_manual(values=c("#de2d26", "#252525", "#f7f7f7"))+
xlim(c(0,1))+theme_basic()+
xlab("Mean PHB")+
ylab("Frequency")+
labs(fill="Pathogen type")+
geom_vline(xintercept = median(filter(unique_pathogens, Type == "Bacteria")$PHB[which(filter(unique_pathogens, Type == "Bacteria")$PHB!=0)]),
colour="white", size=2, alpha=0.8)+
geom_vline(xintercept = median(filter(unique_pathogens, Type == "Bacteria")$PHB[which(filter(unique_pathogens, Type == "Bacteria")$PHB!=0)]),
colour="black", size=1, linetype="dashed")+ # Second to make colour clear
geom_vline(xintercept = median(filter(unique_pathogens, Type == "Virus")$PHB[which(filter(unique_pathogens, Type == "Virus")$PHB!=0)]),
colour="white", size=2, alpha=0.8)+
geom_vline(xintercept = median(filter(unique_pathogens, Type == "Virus")$PHB[which(filter(unique_pathogens, Type == "Virus")$PHB!=0)]),
colour="red", size=1, linetype="dashed")+ # Second to make colour clear
theme(axis.text=element_text(size=16),
legend.text =element_text(colour="black", size=16),
legend.title=element_text(colour="black", size=18),
strip.text.x=element_text(colour="black", size=22),
axis.title=element_text(colour="black", size=22))+
theme(legend.position="none")
# Perform Wilcox test (for the caption of the figure)
wilcox.test(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Bacteria"),"PHB"], unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Virus"),"PHB"])
##
## Wilcoxon rank sum test with continuity correction
##
## data: unique_pathogens[which(unique_pathogens$PHB > 0 & unique_pathogens$Type == and unique_pathogens[which(unique_pathogens$PHB > 0 & unique_pathogens$Type == "Bacteria"), "PHB"] and "Virus"), "PHB"]
## W = 105382, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
median(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Bacteria"),"PHB"])
## [1] 0.5199371
median(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Virus"),"PHB"])
## [1] 0.4091278
# Show plot
p.histogram.figure.2
# Save figure
pdf("figures/Figure-2-histogram-PHB.pdf", width=12, height=8)
p.histogram.figure.2
dev.off()
## quartz_off_screen
## 2
png("figures/Figure-2-histogram-PHB.png", width=1200, height=800)
p.histogram.figure.2
dev.off()
## quartz_off_screen
## 2
options(warn=-1) # Turn off warnings
# Function to plot PHB histogram for a given dataset
plotHistogram <- function(association_dataset, cophenetic_distances){
# Get unique pathogens
unique_pathogens.subsampled <- data.frame(Species=association_dataset[!duplicated(association_dataset$Species),"Species"],
Type=association_dataset[!duplicated(association_dataset$Species),"Type"])
unique_pathogens.subsampled$Species <- as.character(unique_pathogens.subsampled$Species)
# Calculate PHB
unique_pathogens.subsampled$PHB <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset))
unique_pathogens.subsampled$PHB.median <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset, FUN=median))
unique_pathogens.subsampled$PHB.max <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset, FUN = max))
# Number of total hosts
host_species_counts <- data.frame(association_dataset %>% group_by(Species) %>% summarise(count=length(HostSpeciesPHB)))
rownames(host_species_counts) <- host_species_counts$Species
unique_pathogens.subsampled$N.hosts <- host_species_counts[unique_pathogens.subsampled$Species,"count"]
# Gather data
pathogen_range_histo <- data.frame(phb = 1:50/30-0.03,
bacteria = hist(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count,
virus = hist(filter(unique_pathogens.subsampled, Type == "Virus")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count) %>% pivot_longer(cols=c("bacteria", "virus"), names_to = "type", values_to = "count")
pathogen_range_histo$count.norm <- pathogen_range_histo$count/sum(pathogen_range_histo$count)
pathogen_range_histo$type <- ifelse(pathogen_range_histo$type=="bacteria", "Bacteria", "Virus")
pathogen_range_histo.total <- data.frame(pathogen_range_histo %>% group_by(phb) %>% summarise(count=sum(count)))
# Pathogen range histogram, excluding zeros
pathogen_range_histo$type <- ordered(pathogen_range_histo$type, levels=c("Virus", "Bacteria"))
levels(pathogen_range_histo$type) <- c("(a) Viruses", "(b) Bacteria")
p.histogram <- ggplot(pathogen_range_histo, aes(fill=type, x=phb, y=count))+
geom_histogram(colour="black", aes(fill="All"), data=pathogen_range_histo.total, stat="identity", alpha=0.4)+
ylim(c(0,75))+geom_bar(stat="identity")+
facet_wrap(~type, ncol=1)+
scale_fill_manual(values=c("#de2d26", "#252525", "#f7f7f7"))+
xlim(c(0,1))+theme_basic()+
xlab("Mean PHB")+
ylab("Frequency")+
labs(fill="Pathogen type")+
geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB[which(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB!=0)]),
colour="white", size=2, alpha=0.8)+
geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB[which(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB!=0)]),
colour="black", size=1, linetype="dashed")+ # Second to make colour clear
geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Virus")$PHB[which(filter(unique_pathogens.subsampled, Type == "Virus")$PHB!=0)]),
colour="white", size=2, alpha=0.8)+
geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Virus")$PHB[which(filter(unique_pathogens.subsampled, Type == "Virus")$PHB!=0)]),
colour="red", size=1, linetype="dashed")+ # Second to make colour clear
theme(axis.text=element_text(size=16),
legend.text =element_text(colour="black", size=16),
legend.title=element_text(colour="black", size=18),
strip.text.x=element_text(colour="black", size=22),
axis.title=element_text(colour="black", size=22))+
theme(legend.position="none")
return(p.histogram)
}
options(warn=-1) # Turn off warnings
pathogen_vs_host_db_domestic <- pathogen_vs_host_db[which(pathogen_vs_host_db$Domestic=="Yes" & pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
cophenetic_domestic <- cophenetic_matrix[which(rownames(cophenetic_matrix) %in% pathogen_vs_host_db_domestic$HostSpeciesPHB), which(colnames(cophenetic_matrix) %in% pathogen_vs_host_db_domestic$HostSpeciesPHB)]
p.PHB.domestic <- plotHistogram(pathogen_vs_host_db_domestic, cophenetic_domestic)
options(warn=-1) # Turn off warnings
pathogen_vs_host_db_non_domestic <- pathogen_vs_host_db[which(pathogen_vs_host_db$Domestic=="No" & pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
cophenetic_non_domestic <- cophenetic_matrix[which(rownames(cophenetic_matrix) %in% pathogen_vs_host_db_non_domestic$HostSpeciesPHB), which(colnames(cophenetic_matrix) %in% pathogen_vs_host_db_non_domestic$HostSpeciesPHB)]
p.PHB.nondomestic <- plotHistogram(pathogen_vs_host_db_non_domestic, cophenetic_non_domestic)
cowplot::plot_grid(p.PHB.domestic+ggtitle("Domestic (n=2,270)"), p.PHB.nondomestic + ggtitle("Non-domestic (n=5,898)"))
We want to produce an analagous version of Figure 1 from Olival et al. (2017) with our data.
options(warn=-1) # Turn off warnings
# First make a dataset per order of mammals
# Note that we have Artiodactyla where Olival et al. have "Cetartiodactyla" - have changed this
mammal.orders <- stringi::stri_trans_totitle(c( "CINGULATA", "PILOSA","DIDELPHIMORPHIA", "EULIPOTYPHLA",
"CHIROPTERA", "PRIMATES", "RODENTIA", "CARNIVORA", "LAGOMORPHA", "PROBOSCIDEA", "DIPROTODONTIA",
"ARTIODACTYLA", "PERISSODACTYLA", "PERAMELEMORPHIA", "SCANDENTIA"))
pathogen_vs_mammal_db <- pathogen_vs_host_db_no_humans[which(pathogen_vs_host_db_no_humans$HostOrder %in% mammal.orders),]
human_pathogens <- pathogen_vs_host_db$Species[which(pathogen_vs_host_db$HostSpeciesPHB=="Homosapiens")]
pathogen_vs_mammal_db$pathogenSharedWithHumans <- ifelse(pathogen_vs_mammal_db$Species %in% human_pathogens,
"yes", "no")
boxplot.df <- data.frame(pathogen_vs_mammal_db %>%
group_by(HostSpeciesPHB, HostOrder, Type) %>%
summarise(nPathogens=length(Species),
nHumanPathogens=sum(pathogenSharedWithHumans == "yes"),
nWildMammalHosts=sum(Domestic == "No")))
boxplot.df$propHumanPathogens <- boxplot.df$nHumanPathogens/boxplot.df$nPathogens
boxplot.df$propWildMammalHosts <- boxplot.df$nWildMammalHosts/boxplot.df$nPathogens
# Add information on domesticity: is host species domestic or not?
# Need to check this as they can be different for each association
# i.e. is a species completely wild, completely domestic, or a mixture?
host.df <- pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$HostSpeciesPHB)),]
host.df.domestic <- host.df$Domestic
names(host.df.domestic) <- host.df$HostSpeciesPHB
domestic.mammals <- names(host.df.domestic)[which(host.df.domestic=="Yes")]
wild.mammals <- names(host.df.domestic)[which(host.df.domestic=="No")]
captive.mammals <- names(host.df.domestic)[which(host.df.domestic=="Captive")]
# Order hosts using Olival et al. ordering
boxplot.df$HostOrder <- ordered(boxplot.df$HostOrder, levels=mammal.orders)
# Proportion of zoonotic pathogens for viruses
p.virus.zoonotic <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Virus"),], aes( HostOrder, propHumanPathogens))+
geom_boxplot(outlier.shape = NA, fill="#fee5d9")+
theme_basic()+coord_flip()+
geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
scale_fill_continuous(low="white", high="red")+
scale_x_discrete(drop=FALSE)+
theme(legend.position = "none")+
ylab("")+xlab("")+
ylab("Proportion of zoonotic viruses")+
ggtitle("(a)")+
theme(plot.title=element_text(hjust=0))
p.virus.richness <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Virus"),], aes( HostOrder, nPathogens))+
geom_boxplot(outlier.shape = NA, fill="#fee5d9")+
theme_basic()+coord_flip()+
geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
scale_fill_continuous(low="white", high="red", limits=c(0,1))+
ylab("")+xlab("")+
theme(legend.position = "none")+
theme(axis.text.y=element_blank())+
ylab("Total viral richness")+
ggtitle("(b)")+
theme(plot.title=element_text(hjust=0))+
ylim(c(0,60))+
labs(fill="Proportion of wild hosts")+
theme(legend.position = c(0.7,0.15),
legend.title = element_text(size=14),
legend.background = element_rect(colour="black"))
p.bacteria.zoonotic <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Bacteria"),], aes( HostOrder, propHumanPathogens))+
geom_boxplot(outlier.shape = NA, fill="#f7f7f7")+
theme_basic()+coord_flip()+
geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
scale_x_discrete(drop=FALSE)+
scale_fill_continuous(low="white", high="black")+
ylab("")+xlab("")+
theme(legend.position = "none")+
ylab("Proportion of zoonotic bacteria")+
ggtitle("(c)")+
theme(plot.title=element_text(hjust=0))
p.bacteria.richness <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Bacteria"),], aes( HostOrder, nPathogens))+
geom_boxplot(outlier.shape = NA, fill="#f7f7f7")+
theme_basic()+coord_flip()+
geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
scale_fill_continuous(low="white", high="black", limits=c(0,1))+
ylab("")+xlab("")+
scale_x_discrete(drop=FALSE)+
theme(legend.position = "none")+
theme(axis.text.y=element_blank())+
ylab("Total bacterial richness")+
ggtitle("(d)")+
theme(plot.title=element_text(hjust=0))+
ylim(c(0,60))+
labs(fill="Proportion of wild hosts")+
theme(legend.position = c(0.7,0.15),
legend.title = element_text(size=14),
legend.background = element_rect(colour="black"))
pdf(file="figures/Figure-3-richness-per-species.pdf", width=20, height=9)
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
p.bacteria.zoonotic, p.bacteria.richness, nrow=1)
dev.off()
## quartz_off_screen
## 2
png(file="figures/Figure-3-richness-per-species.png", width=2000, height=900)
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
p.bacteria.zoonotic, p.bacteria.richness, nrow=1)
dev.off()
## quartz_off_screen
## 2
# Show plot
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
p.bacteria.zoonotic, p.bacteria.richness, nrow=1)
# Correlation of bacterial and viral richness per host order
boxplot.df.bacteria <- boxplot.df[which(boxplot.df$Type=="Bacteria"),]
boxplot.df.virus <- boxplot.df[which(boxplot.df$Type=="Virus"),]
# Shared hosts i.e. have both bacteria and virus associations
shared.hosts <- boxplot.df.virus$HostSpeciesPHB[which(boxplot.df.virus$HostSpeciesPHB %in% boxplot.df.bacteria$HostSpeciesPHB)]
boxplot.df.shared <- boxplot.df[which(boxplot.df$HostSpeciesPHB %in% shared.hosts),]
# Now they are sorted, so can easily separate
mammal.bacterial.viral.richness <- data.frame(species=shared.hosts,
nBacteria=boxplot.df.shared$nPathogens[which(boxplot.df.shared$Type=="Bacteria")],
propBacteriaSharedHumans=boxplot.df.shared$propHumanPathogens[which(boxplot.df.shared$Type=="Bacteria")],
nVirus=boxplot.df.shared$nPathogens[which(boxplot.df.shared$Type=="Virus")],
propVirusSharedHumans=boxplot.df.shared$propHumanPathogens[which(boxplot.df.shared$Type=="Virus")])
cor.test(mammal.bacterial.viral.richness$nBacteria, mammal.bacterial.viral.richness$nVirus)
##
## Pearson's product-moment correlation
##
## data: mammal.bacterial.viral.richness$nBacteria and mammal.bacterial.viral.richness$nVirus
## t = 35.68, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8587839 0.9045045
## sample estimates:
## cor
## 0.8837353
First we do GAMs for host traits which predict viral and bacterial richness.
options(warn=-1) # Turn off warnings
source('scripts/04-fit-GAMs-host-traits.R')
source('scripts/05-plot-GAMs-host-traits.R')
allplots
Then GAMs for predicting zoonotic potential.
detachAllPackages <- function() {
basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
detachAllPackages()
options(warn=-1)
library(stringi)
library(parallel)
library(magrittr)
library(purrr)
library(mgcv)
library(ggplot2)
library(cowplot)
library(viridis)
library(svglite)
library(phangorn)
library(MuMIn)
library(dplyr)
source('scripts/06-fit-GAM-viral-zoonotic-potential.R')
source('scripts/07-fit-GAM-bacterial-zoonotic-potential.R')
allplots = cowplot::plot_grid(allplots_viral, allplots_bacterial, nrow=2, rel_widths = c(5.3, 5.3, 5.3))
# Save plots
png(file="figures/Figure-5-gams-zoonotic-potential.png", width = convertr::convert(183, "mm", "in")*500, convertr::convert(100, "mm", "in")*600, pointsize=7, res=600)
allplots
dev.off()
## quartz_off_screen
## 2
pdf(file="figures/Figure-5-gams-zoonotic-potential.pdf", width =7, height=5, pointsize=7)
allplots
dev.off()
## quartz_off_screen
## 2
# Show plot
allplots
We make a plot of the host switching for bacteria and viruses together (left panel of figure). We restrict
hostShiftingResultsOrder <- function(type="Bacteria", min.rep=0){
metadata.type <- metadata[which(metadata$Type %in% type),]
# Get host orders, and keep only those with more than min.rep representatives
type.host.orders.t <- table(metadata.type$HostOrder)
type.host.orders <- names(type.host.orders.t[type.host.orders.t>min.rep])
# Prepare results to fill
results <- matrix(nrow=0, ncol=8)
# For each host order
for (reference.order in type.host.orders){
reference.order.pathogens <- unique(metadata.type[which(metadata.type$HostOrder==reference.order), "Species"])
reference.order.species <- unique(as.character(metadata.type[which(metadata.type$HostOrder==reference.order), "HostSpeciesPHB"]))
# For each host order, return percentage of pathogens shared with host order vs. mean distance
for (comparison.order in type.host.orders){
# All pathogens in that host order
pathogens <- metadata.type[which(metadata.type$HostOrder==comparison.order),"Species"]
# Percentage of those pathogens that are shared with reference order
perc.shared <- length(pathogens[pathogens %in% reference.order.pathogens])/length(pathogens)
perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.order.pathogens])/length(unique(c(pathogens, reference.order.pathogens)))
# Mean of all host order in order from reference order
comparison.order.species <- unique(as.character(metadata.type[which(metadata.type$HostOrder==comparison.order),"HostSpeciesPHB"]))
mean.distance <- mean(d[comparison.order.species, reference.order.species])
sd.distance <- sd(d[comparison.order.species, reference.order.species])
# Add to results
if (reference.order==comparison.order){
mean.distance <- NA
sd.distance <- NA
perc.shared <- NA
perc.shared.unique <- NA
}
results <- rbind(results, c(reference.order,
comparison.order,
mean.distance,
sd.distance,
perc.shared,
perc.shared.unique,
length(reference.order.species),
length(comparison.order.species) ))
}
}
# Naming of data frame, add names of host orders
results <- as.data.frame(results, stringsAsFactors=FALSE)
colnames(results) <- c("order.1",
"order.2",
"mean.distance",
"sd.distance",
"perc.shared",
"perc.shared.unique",
"n.associations.order.1",
"n.associations.order.2")
results$perc.shared <- as.numeric(results$perc.shared)
results$perc.shared.unique <- as.numeric(results$perc.shared.unique)
results$mean.distance <- as.numeric(results$mean.distance)
results$sd.distance <- as.numeric(results$sd.distance)
return(results)
}
hostShiftingResultsFamily <- function(type="Bacteria", min.rep=0){
metadata.type <- metadata[which(metadata$Type %in% type),]
# Get host families, and keep only those with more than min.rep representatives
type.host.families.t <- table(metadata.type$HostFamily)
type.host.families <- names(type.host.families.t[type.host.families.t>min.rep])
# Get list of host.species
host.table <- table(metadata.type$HostSpeciesPHB)
host.species <- names(host.table)[which(host.table>min.rep)]
results <- matrix(nrow=0, ncol=8)
# For each host family
for (reference.family in type.host.families){
reference.family.pathogens <- unique(metadata.type[which(metadata.type$HostFamily==reference.family), "Species"])
reference.family.species <- unique(as.character(metadata.type[which(metadata.type$HostFamily==reference.family), "HostSpeciesPHB"]))
# For each host order, return percentage of pathogens shared with host family vs. mean distance
for (comparison.family in type.host.families){
# All pathogens in that host family
pathogens <- metadata.type[which(metadata.type$HostFamily==comparison.family),"Species"]
# Percentage of those pathogens that are shared with reference order
perc.shared <- length(pathogens[pathogens %in% reference.family.pathogens])/length(pathogens)
perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.order.pathogens])/length(unique(c(pathogens, reference.order.pathogens)))
# Mean of all host order in order from reference order
comparison.family.species <- unique(as.character(metadata.type[which(metadata.type$HostFamily==comparison.family),"HostSpeciesPHB"]))
mean.distance <- mean(d[comparison.family.species, reference.family.species])
sd.distance <- sd(d[comparison.family.species, reference.family.species])
# Add to results
results <- rbind(results, c(reference.family,
comparison.family,
mean.distance,
sd.distance,
perc.shared,
perc.shared.unique,
length(metadata.type[which(metadata.type$HostFamily==reference.family), "HostSpeciesPHB"]),
length(metadata.type[which(metadata.type$HostFamily==comparison.family), "HostSpeciesPHB"]) ))
}
}
# Naming of data frame, add names of host orders
results <- as.data.frame(results, stringsAsFactors=FALSE)
colnames(results) <- c("family.1",
"family.2",
"mean.distance",
"sd.distance",
"perc.shared",
"perc.shared.unique",
"n.associations.family.1",
"n.associations.order.2")
results$perc.shared <- as.numeric(results$perc.shared)
results$perc.shared.unique <- as.numeric(results$perc.shared.unique)
results$mean.distance <- as.numeric(results$mean.distance)
results$sd.distance <- as.numeric(results$sd.distance)
return(results)
}
hostShiftingResultsSpecies <- function(type="Bacteria", min.rep=0){
metadata.type <- metadata[which(metadata$Type %in% type),]
# Get host orders, and keep only those with more than min.rep representatives
type.host.species.t <- table(metadata.type$HostSpeciesPHB)
type.host.species <- names(type.host.species.t[type.host.species.t>min.rep])
# Get list of host.species
host.species <- type.host.species
results <- matrix(nrow=0, ncol=8)
# For each host species
for (reference.species in host.species){
reference.species.pathogens <- unique(metadata.type[which(metadata.type$HostSpeciesPHB==reference.species), "Species"])
# For each host species, return percentage of pathogens shared with host species vs. mean distance
for (comparison.species in host.species){
# All pathogens in that host species
pathogens <- metadata.type[which(metadata.type$HostSpeciesPHB==comparison.species),"Species"]
# Percentage of those pathogens that are shared with reference species
perc.shared <- length(pathogens[pathogens %in% reference.species.pathogens])/length(pathogens)
perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.species.pathogens])/length(unique(c(pathogens, reference.species.pathogens)))
# Mean of all host species in order from reference order
mean.distance <- mean(d[comparison.species, reference.species])
sd.distance <- sd(d[comparison.species, reference.species])
# Add to results
results <- rbind(results, c(reference.species,
comparison.species,
mean.distance,
sd.distance,
perc.shared,
perc.shared.unique,
length(reference.species),
length(comparison.species) ))
}
}
# Naming of data frame, add names of host orders
results <- as.data.frame(results, stringsAsFactors=FALSE)
colnames(results) <- c("species.1",
"species.2",
"mean.distance",
"sd.distance",
"perc.shared",
"perc.shared.unique",
"n.hosts.species.1",
"n.hosts.species.2")
results$perc.shared <- as.numeric(results$perc.shared)
results$perc.shared.unique <- as.numeric(results$perc.shared.unique)
results$mean.distance <- as.numeric(results$mean.distance)
results$sd.distance <- as.numeric(results$sd.distance)
return(results)
}
hostShiftingPlot <- function(results, title.string=""){
# Correlation test
correlation <- cor.test(results$mean.distance, results$perc.shared, method = "spearman")
# Label for correlation statistics
corr.label <- data.frame(
mean.distance = 0,
perc.shared = 0.1,
label = paste("Spearman's rho = ", myround(correlation$estimate, 3),
"\n p = ", myround(correlation$p.value, 3)),
host.order="",
n.hosts="NA*")
# Plot results
p <- ggplot(results, aes(x=mean.distance, y=perc.shared.unique))+
geom_point( fill="black", alpha=0.2)+
stat_smooth(method="loess", aes(group=1), se = FALSE, size=2, alpha=0.5)+
xlim(c(0,max(results$mean.distance)))+
ylim(c(0,1))+
xlab(paste("Mean phylogenetic distance"))+
ylab(paste("Fraction of shared pathogens"))+
theme_basic()+
labs(size="Number of host-pathogen associations")+
theme(axis.text=element_text(size=16),
legend.text =element_text(colour="black", size=16),
legend.title=element_text(colour="black", size=18),
strip.text.x=element_text(colour="black", size=22),
axis.title=element_text(colour="black", size=22))
return(p)
}
metadata <- pathogen_vs_host_db
d <- cophenetic_matrix
results.orders.10 <- hostShiftingResultsOrder(type=c("Bacteria", "Virus"), min.rep=10)
p.pathogen.sharing.orders.10 <- hostShiftingPlot(results.orders.10)+xlab("Mean phylogenetic distance between orders")
results.orders.5 <- hostShiftingResultsOrder(type=c("Bacteria", "Virus"), min.rep=5)
p.pathogen.sharing.orders.5 <- hostShiftingPlot(results.orders.5)
results.species.10 <- hostShiftingResultsSpecies(type=c("Bacteria", "Virus"), min.rep=10)
# plot, but exclude comparisons of species with themselves, and the outliers
p.pathogen.sharing.species.10 <- hostShiftingPlot(results.species.10[which(results.species.10$mean.distance!=0 &
results.species.10$mean.distance<1.8),])
We also carry out a Mantel test for correlation of the two distance matrices.
mean.dist <- acast(species.1 ~ species.2, value.var = "mean.distance", data=results.species.10)
perc.shared <- acast(species.1 ~ species.2, value.var = "perc.shared.unique", data=results.species.10)
mantel.test(mean.dist, perc.shared)
## $z.stat
## [1] 303.546
##
## $p
## [1] 0.001
##
## $alternative
## [1] "two.sided"
We construct the right panel of the figure. We make a separate data frame for bacteria and for viruses, then combine these together and plot the results.
host.orders <- table(metadata$HostOrder)
min.rep.10 <- names(host.orders)[host.orders>10]
hostShiftingPlotDataHuman <- function(type=c("Bacteria", "Virus"), reference.species="Homosapiens", host.orders = min.rep.10, title.string=""){
# Limit to only pathogens of interest (bacteria, viruses, or both)
metadata.type <- metadata[which(metadata$Type %in% type),]
# Get host orders, and keep only those with more than min.rep representatives
type.host.orders <- min.rep.10
# Get list of reference.species pathogens
reference.species.pathogens <- unique(metadata.type[which(metadata.type$HostSpeciesPHB==reference.species), "Species"])
# For each host order, return percentage of pathogens shared with humans vs. mean distance of host species from reference species
results <- matrix(nrow=0, ncol=3)
for (b in type.host.orders){
# All pathogens in that host order, excluding the reference species
pathogens <- unique(metadata.type[which(metadata.type$HostOrder==b & metadata.type$HostSpeciesPHB!=reference.species),"Species"])
# Percentage of those pathogens that are shared with humans
perc.shared <- length(pathogens[pathogens %in% reference.species.pathogens])/length(pathogens)
# Mean of all host species in order from reference species. Note that we consider all species,
# not just those that have a pathogen association of the right type
host.species <- as.character(metadata[which(metadata$HostOrder==b),"HostSpeciesPHB"])
mean.distance <- mean(d[host.species, reference.species])
# Add to results
results <- rbind(results, c(perc.shared, mean.distance, length(unique(host.species))))
}
# Naming of data frame, add names of host orders
results <- as.data.frame(results)
colnames(results) <- c("perc.shared", "mean.distance", "n.hosts")
results$host.order <- type.host.orders
# Correlation test
correlation <- cor.test(results$mean.distance, results$perc.shared, method = "spearman")
# Label for correlation statistics
corr.label <- data.frame(
mean.distance = 0,
perc.shared = 0.1,
label = paste("Spearman's rho = ", myround(correlation$estimate, 3),
"\n p = ", myround(correlation$p.value, 3)),
host.order="",
n.hosts="NA*")
# Return results
return(results)
}
human.sharing.bacteria <- hostShiftingPlotDataHuman("Bacteria")
human.sharing.bacteria$type <- "Bacteria"
human.sharing.virus <- hostShiftingPlotDataHuman("Virus")
human.sharing.virus$type <- "Virus"
plot.df <- rbind(human.sharing.bacteria, human.sharing.virus)
# The parameter starting values for the sigmoidal fit are from fitting the sigmoid the bacteria data only with nls:
# fit <- nls(perc.shared ~ SSlogis(mean.distance, Asym, xmid, scal), data = human.sharing.bacteria)
p.human.sharing <- ggplot(plot.df, aes(x=mean.distance, y=perc.shared))+
geom_point(aes(size=n.hosts, fill=type), colour="black", pch=21, alpha=0.5)+
xlim(c(0,max(plot.df$mean.distance)))+
geom_smooth(method="nls", aes(group=type, colour=type),size=2, formula= y ~ Asym/(1+exp((xmid-x)/scal)), method.args = list(start=c(Asym=0.7, xmid=1.31, scal=-0.33)), se=FALSE)+
scale_fill_manual(values=c("black", "red"))+
scale_colour_manual(values=c("black", "red"))+
scale_size_continuous(range = c(2,10))+
ylim(c(0,1))+
labs(size="Number of\nhost-pathogen\nassociations", fill="Pathogen type", colour="Pathogen type")+
xlab(paste("Mean phylogenetic distance to Homo sapiens"))+
ylab(paste("Fraction of shared pathogens with Homo sapiens"))+
theme_basic()+
theme(axis.text=element_text(size=16),
legend.text =element_text(colour="black", size=16),
legend.title=element_text(colour="black", size=18),
strip.text.x=element_text(colour="black", size=22),
axis.title=element_text(colour="black", size=22))
We combine the left and right panels together to make a preliminary version of the figure. The paper version was produced manually from this preliminary version in Inkscape, using animal silhouettes from phylopic.org.
pdf("figures/Figure-6-pathogen-sharing.pdf", width=17, height=10)
cowplot::plot_grid(p.pathogen.sharing.orders.10+ggtitle("(a)"), p.human.sharing+ggtitle("(b)"))
dev.off()
## quartz_off_screen
## 2
png("figures/Figure-6-pathogen-sharing.png", width=1700, height=1000, pointsize = 14)
cowplot::plot_grid(p.pathogen.sharing.orders.10, p.human.sharing)
dev.off()
## quartz_off_screen
## 2
# Show plot
cowplot::plot_grid(p.pathogen.sharing.orders.10, p.human.sharing)
It is useful to be able to match the points and names together, which can be done with the plot below.
ggplot(plot.df, aes(x=mean.distance, y=perc.shared))+
geom_text(aes(label=host.order, colour=type), nudge_x = 0,nudge_y=0, size=1, angle=45)+
stat_smooth(method="loess", aes(group=type, colour=type),size=2, se = FALSE, formula = y ~ exp(x)/(exp(x)+1))+
xlim(c(0,max(plot.df$mean.distance)))+
scale_fill_manual(values=c("black", "red"))+
scale_colour_manual(values=c("black", "red"))+
scale_size_continuous(range = c(2,10))+
ylim(c(0,1))+
labs(size="Number of\nhost-pathogen\nassociations", fill="Pathogen type", colour="Pathogen type")+
xlab(paste("Mean phylogenetic distance to Homo sapiens"))+
ylab(paste("Fraction of shared pathogens with Homo sapiens"))+
theme_basic()+geom_point(aes(colour=type))+facet_wrap(~type)
ggsave("figures/Figure-6-pathogen-sharing-right-panel-names.pdf")
## Saving 12 x 8 in image
We compare our mitochondrial gene phylogeny with the cytb phylogeny of Olival et al.
olival.tree <- read.tree('data/olival_cytb_supertree.tree')
# Convert tip labels
olival.tree$tip.label <- gsub("_", "", olival.tree$tip.label)
# Only consider shared species
olival.tree.shared <- drop.tip(olival.tree,
tip=olival.tree$tip.label[which(!olival.tree$tip.label %in% host_tree$tip.label)])
olival.tree.cophenetic <- cophenetic(olival.tree.shared)
this.study.tree.cophenetic <- d[rownames(olival.tree.cophenetic), rownames(olival.tree.cophenetic)]
# Plot cophenetic distances
tree.df <- data.frame(cbind(as.vector(olival.tree.cophenetic), as.vector(this.study.tree.cophenetic)))
p.olival.cophenetic.comparison <- ggplot(tree.df, aes(X1, X2))+
geom_point(size=0.25, alpha=0.1)+
theme_basic()+
xlab("Cophenetic distance (Olival et al. 2017)")+
ylab("Cophenetic distance (this study)")+
geom_abline(linetype='dashed')+
stat_smooth(method="lm", colour="red")
p.olival.cophenetic.comparison
png("figures/Supplementary-Figure-1-Cophenetic-Tree-Comparison.png", width=800, height=1200)
p.olival.cophenetic.comparison
dev.off()
## quartz_off_screen
## 2
Mean and maximum PHB are well-correlated for most pathogens.
p.mean.max.phb <- ggplot(unique_pathogens, aes(x=PHB, y=PHB.max))+geom_point(aes(size=N.hosts, colour=Type))+theme_basic()+
facet_wrap(~Type)+
scale_color_manual(values=c("black", "red"))+
geom_abline(slope=1, intercept=0)
p.mean.max.phb
pdf("figures/Supplementary-Figure-2-PHB-correlation.pdf", width=12, height=8)
p.mean.max.phb
dev.off()
## quartz_off_screen
## 2
Shows how PHB varies within bacterial families.
# Family
min.abundance <- 20
df.generalist.prop.family <- data.frame(bacteria.df %>% group_by(family) %>% summarise(prop.generalist=length(n.hosts[which(n.hosts!=1)])/length(n.hosts), species=length(n.hosts)))
df.generalist.prop.family$family <- ordered(df.generalist.prop.family$family, levels=df.generalist.prop.family$family[order(df.generalist.prop.family$prop.generalist, decreasing=TRUE)])
p.bacteria.family.generalist <- ggplot(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist!=0),], aes(family, prop.generalist, size=species))+
geom_point()+
theme_basic()+
xlab("")+
ylab("Proportion of generalists")+
labs(size="Number\nof species")+
ylim(c(0,1))+
coord_flip()+
theme(legend.position = "none")+
ggtitle("(a) Proportion of generalists")+
theme(plot.margin=unit(c(1,0.5,1,1), "cm"))+
theme(legend.position = "none")
# Get these top families
top.families <- as.character(df.generalist.prop.family$family[which(df.generalist.prop.family$species>min.abundance)])
plot.df <- bacteria.df[which(bacteria.df$family %in% top.families & bacteria.df$bacteria.phbs!=0),]
shared.families <- unique(plot.df$family[which(plot.df$family %in% top.families)])
plot.df$family <- ordered(plot.df$family, levels=levels(df.generalist.prop.family$family))
p.bacteria.family.phb <- ggplot(plot.df, aes(family, bacteria.phbs))+
geom_boxplot(outlier.shape=NA)+
geom_jitter(width=0.05, height=0, pch=21, fill='white', colour='black')+
theme_basic()+
xlab("")+
ylab("PHB")+
ylim(c(0,1))+
coord_flip()+
theme(axis.text.y=element_blank())+
ggtitle("(b) PHB of generalists")+
theme(plot.margin=unit(c(1,1,1,-0.5), "cm"))
cowplot::plot_grid(p.bacteria.family.generalist,
p.bacteria.family.phb,
ncol=2)
pdf("figures/Supplementary-Figure-3-bacterial-families.pdf", width=12, height=8)
cowplot::plot_grid(p.bacteria.family.generalist,
p.bacteria.family.phb,
ncol=2)
dev.off()
## quartz_off_screen
## 2
The same approach as Supplementary Figure 3, but now for viral families
# Family
min.abundance <- 20
df.generalist.prop.family <- data.frame(virus.df %>% group_by(family) %>% summarise(prop.generalist=length(n.hosts[which(n.hosts!=1)])/length(n.hosts), species=length(n.hosts)))
df.generalist.prop.family$family <- ordered(df.generalist.prop.family$family, levels=df.generalist.prop.family$family[order(df.generalist.prop.family$prop.generalist, decreasing=TRUE)])
p.virus.family.generalist <- ggplot(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist>0.05),], aes(family, prop.generalist, size=species))+
geom_point()+
theme_basic()+
xlab("")+
ylab("Proportion of generalists")+
labs(size="Number\nof species")+
ylim(c(0,1))+
coord_flip()+
ggtitle("(a) Proportion of generalists")+
theme(plot.margin=unit(c(1,0.5,1,1), "cm"))+
theme(legend.position = "none")
# Get these top families
top.families <- as.character(df.generalist.prop.family$family[which(df.generalist.prop.family$species>min.abundance)])
plot.df <- virus.df[which(virus.df$family %in% top.families & virus.df$virus.phbs!=0),]
shared.families <- unique(plot.df$family[which(plot.df$family %in% top.families)])
plot.df$family <- ordered(plot.df$family, levels=levels(df.generalist.prop.family$family))
# family overlap
family.overlap <- unique(as.character(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist!=0),"family"]))
p.virus.family.phb <- ggplot(plot.df[which(plot.df$family %in% family.overlap),], aes(family, virus.phbs))+
geom_boxplot(outlier.shape=NA)+
geom_jitter(width=0.05, height=0, pch=21, fill='white', colour='black')+
theme_basic()+
xlab("")+
ylab("PHB")+
coord_flip()+
theme(axis.text.y=element_blank())+
ggtitle("(b) PHB of generalists")+
theme(plot.margin=unit(c(1,1,1,-0.5), "cm"))
cowplot::plot_grid(p.virus.family.generalist,
p.virus.family.phb,
ncol=2)
pdf("figures/Supplementary-Figure-4-viral-families.pdf", width=12, height=8)
cowplot::plot_grid(p.virus.family.generalist,
p.virus.family.phb,
ncol=2)
dev.off()
## quartz_off_screen
## 2
How do genome GC content and size vary in specialist vs. generalist pathogens?
bacteria.unique <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Bacteria" & !duplicated(pathogen_vs_host_db$Species)),]
rownames(bacteria.unique) <- bacteria.unique$Species
bacteria.unique$PHB <- sapply(bacteria.unique$Species, function(x) PHB(x, m=pathogen_vs_host_db))
virus.unique <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Virus" & !duplicated(pathogen_vs_host_db$Species)),]
virus.unique$PHB <- sapply(virus.unique$Species, function(x) PHB(x, m=pathogen_vs_host_db))
bacteria.unique$Specialist <- ordered(ifelse(bacteria.unique$PHB>0, "Generalist", "Specialist"),
levels=c("Specialist", "Generalist"))
virus.unique$Specialist <- ordered(ifelse(virus.unique$PHB>0, "Generalist", "Specialist"),
levels=c("Specialist", "Generalist"))
p.bacteria.GC <- ggplot(bacteria.unique, aes(Specialist, Genome.GC, fill=Specialist))+
geom_violin()+
stat_summary(geom = "point",pch=1, fun.y = "median", size=5)+
stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
theme_basic()+
scale_fill_manual(values=c("#f0f0f0", "#636363"))+
ylim(c(20, 80))+
xlab("")+
ylab("GC content (%)")+
theme(legend.position="none")+
ggtitle("Bacteria")+
theme(plot.title=element_text(size=24, hjust=0.5))
p.virus.GC <- ggplot(virus.unique, aes(Specialist, Genome.GC, fill=Specialist))+
geom_violin()+
stat_summary(geom = "point",pch=1, fun.y = "median", size=5)+
stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
theme_basic()+
scale_fill_manual(values=c("#fee0d2", "#de2d26"))+
ylim(c(20, 80))+
xlab("")+
ylab("GC content (%)")+
theme(legend.position="none")+
ggtitle("Viruses")+
theme(plot.title=element_text(size=24, hjust=0.5))
p.bacteria.size <- ggplot(bacteria.unique, aes(Specialist, Genome.size, fill=Specialist))+
geom_violin()+
stat_summary(geom = "point",pch=1, fun.y = "median", size=5)+
stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
theme_basic()+
scale_fill_manual(values=c("#f0f0f0", "#636363"))+
scale_y_log10()+
ylab("Log(genome size in Mb)")+
xlab("")+
ylab("Log(genome size in Mb)")+
theme(legend.position="none")
p.virus.size <- ggplot(virus.unique, aes(Specialist, Genome.size, fill=Specialist))+
geom_violin()+
stat_summary(geom = "point",pch=1, fun.y = "median", size=5)+
stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
theme_basic()+
scale_y_log10()+
scale_fill_manual(values=c("#fee0d2", "#de2d26"))+
xlab("")+
ylab("Log(genome size in Mb)")+
theme(legend.position="none")
cowplot::plot_grid(p.bacteria.GC, p.virus.GC, p.bacteria.size, p.virus.size, ncol=2, widths=c(1,1))
# Save figure
pdf("figures/Supplementary-Figure-5-GC-content-genome-size.pdf", width=8, height=8)
cowplot::plot_grid(p.bacteria.GC, p.virus.GC, p.bacteria.size, p.virus.size, ncol=2, widths=c(1,1))
dev.off()
## quartz_off_screen
## 2
Effect of genome size of bacteria.
bacteria.specialist.genome.sizes <- bacteria.unique$Genome.size[which(bacteria.unique$PHB==0)]
bacteria.generalist.genome.sizes <- bacteria.unique$Genome.size[which(bacteria.unique$PHB!=0)]
# Missing data - number of associations kept
nrow(pathogen_vs_host_db[which(pathogen_vs_host_db$Species %in% rownames(bacteria.unique)[which(!is.na(bacteria.unique$Genome.size) & !is.na(bacteria.unique$Genome.GC))]),])
## [1] 4619
table(is.na(bacteria.specialist.genome.sizes))
##
## FALSE TRUE
## 827 466
table(is.na(bacteria.generalist.genome.sizes))
##
## FALSE TRUE
## 308 84
wilcox.test(x = na.omit(bacteria.specialist.genome.sizes),
y = na.omit(bacteria.generalist.genome.sizes))
##
## Wilcoxon rank sum test with continuity correction
##
## data: na.omit(bacteria.specialist.genome.sizes) and na.omit(bacteria.generalist.genome.sizes)
## W = 140606, p-value = 0.00698
## alternative hypothesis: true location shift is not equal to 0
Plot of virus PHB for different genome types (RNA/DNA). We use the Baltimore classification (type of genome and method of classification).
options(warn=-1) # Turn off warnings
# Virus genome type
virus.df$genome.type <- ordered(virus.df$genome.type, levels=c("ssRNA-RT", "(-) ssRNA", "(+) ssRNA", "dsRNA", "dsDNA-RT", "dsDNA"))
p.virus<- ggplot(virus.df[!is.na(virus.df$genome.type),], aes(genome.type, virus.phbs, colour=genome.type.rna.dna))+
geom_boxplot(outlier.shape=NA)+geom_jitter(height=0, width=0.25, alpha=0.8)+
theme_basic()+
xlab("")+
ylab("Mean PHB")+
scale_color_manual(values=c("#377eb8", "#4daf4a"))+
labs(colour="Genomic material")+
ylim(c(0,1))+
theme(axis.text=element_text(size=16),
legend.text =element_text(colour="black", size=16),
legend.title=element_text(colour="black", size=18),
strip.text.x=element_text(colour="black", size=22),
axis.title=element_text(colour="black", size=22))
p.virus
# Save figure
pdf("figures/Supplementary-Figure-6-virus-genome-type.pdf", width=12, height=8)
p.virus
dev.off()
## quartz_off_screen
## 2
We plot the effect of bacterial lifestyle on the proportion of specialists vs. generalists.
bacteria.df$gram <- as.character(bacteria.df$gram)
bacteria.df$gram[which(is.na(bacteria.df$gram) | bacteria.df$gram %in% c("na", ""))] <- "Unknown"
# Manually amend lactobacillus spp and
bacteria.df[which(bacteria.df$bacteria.names=="Lactobacillus spp"),"gram"] <- "Gram-positive"
bacteria.df[which(bacteria.df$bacteria.names=="Segniliparus rugosus"),"gram"] <- "Gram-positive"
gram.plot.df <- data.frame(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)) * 100)
gram.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)),
Var1=rownames(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)),
Freq=3)
gram.plot <- ggplot(gram.plot.df[which(!gram.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
geom_bar(position="stack", stat="identity", colour="black")+
theme_basic()+
scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
guides(fill=FALSE)+
xlab("")+
ylab("Within-category percentage (%)")+
theme(legend.text=element_text(size=14))+
ggtitle("Gram stain")+
theme(plot.title=element_text(size=25, hjust=0.5))+
geom_label(data=gram.plot.N.total.df[which(!gram.plot.df$Var1 %in% c("", "na")),], aes(label=N), fill=NA, size=6)
bacteria.df$motility <- as.character(bacteria.df$motility)
bacteria.df$motility[which(is.na(bacteria.df$motility) | bacteria.df$motility %in% c("na", ""))] <- "Unknown"
motile.plot.df <- data.frame(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0))*100)
motile.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)),
Var1=rownames(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)),
Freq=5)
motile.plot <- ggplot(motile.plot.df[which(!motile.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
geom_bar(position="stack", stat="identity", colour="black")+
theme_basic()+
scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
guides(fill=FALSE)+
xlab("")+
ylab("Within-category percentage (%)")+
theme(legend.text=element_text(size=14))+
ggtitle("Motility")+
theme(plot.title=element_text(size=25, hjust=0.5))+
geom_label(data=motile.plot.N.total.df[which(!motile.plot.df$Var1 %in% c("", "na")),], aes(label=N), fill=NA, size=6)
bacteria.df$cell <- as.character(bacteria.df$cell)
bacteria.df$cell[which(is.na(bacteria.df$cell))] <- "Unknown"
cell.plot.df <- data.frame(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)) * 100)
cell.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)),
Var1=rownames(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)),
Freq=5)
cell.plot <- ggplot(cell.plot.df[which(!cell.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
geom_bar(position="stack", stat="identity", colour="black")+
theme_basic()+
scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
guides(fill=FALSE)+
xlab("")+
ylab("Within-category percentage (%)")+
theme(legend.text=element_text(size=14))+
ggtitle("Cellular lifestyle")+
theme(plot.title=element_text(size=25, hjust=0.5))+
geom_label(data=cell.plot.N.total.df, aes(label=N), fill=NA, size=6)
bacteria.df$oxygen <- as.character(bacteria.df$oxygen)
# Manually correct Lactobacilli
bacteria.df$oxygen[which(bacteria.df$bacteria.names=="Lactobacillus spp")] <- "Facultatively anaerobic"
oxygen.plot.df <- data.frame(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)) * 100)
oxygen.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)),
Var1=rownames(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)),
Freq=5)
oxygen.plot <- ggplot(oxygen.plot.df[which(!oxygen.plot.df$Var1 %in% c( "Capnophilic")),], aes(Var1, Freq, fill=Var2))+
geom_bar(position="stack", stat="identity", colour="black")+
theme_basic()+
scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
guides(fill=FALSE)+
xlab("")+
ylab("Within-category percentage (%)")+
theme(legend.text=element_text(size=14))+
ggtitle("Oxygen requirements")+
theme(plot.title=element_text(size=25, hjust=0.5))+
geom_label(data=oxygen.plot.N.total.df[which(!oxygen.plot.N.total.df$Var1 %in% c( "Capnophilic")),], aes(label=N), fill=NA, size=6)
bacteria.df$spore <- as.character(bacteria.df$spore)
bacteria.df$spore[which(bacteria.df$spore %in% c("na", "") | is.na(bacteria.df$spore))] <- "Unknown"
bacteria.df$spore[which(bacteria.df$spore=="Yes")] <- "Spore-forming"
bacteria.df$spore[which(bacteria.df$spore=="No")] <- "Non-spore-forming"
spore.plot.df <- data.frame(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)) * 100)
spore.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)),
Var1=rownames(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)),
Freq=5)
spore.plot <- ggplot(spore.plot.df[which(!spore.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
geom_bar(position="stack", stat="identity", colour="black")+
theme_basic()+
scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
guides(fill=FALSE)+
xlab("")+
ylab("Within-category percentage (%)")+
theme(legend.text=element_text(size=14))+
ggtitle("Spore formation")+
theme(plot.title=element_text(size=25, hjust=0.5))+
geom_label(data=spore.plot.N.total.df, aes(label=N), fill=NA, size=6)
cowplot::plot_grid(cell.plot, motile.plot, oxygen.plot, spore.plot, gram.plot, widths=c(4, 3), ncol=2)
# Save figure
pdf("figures/Supplementary-Figure-7-bacterial-lifestyle.pdf", width=14, height=18)
cowplot::plot_grid(cell.plot, motile.plot, oxygen.plot, spore.plot, gram.plot, widths=c(4, 3), ncol=2)
dev.off()
## quartz_off_screen
## 2
We now consider modelling the effect of motility and cellular behaviour together. No linear models show anything interesting.
# Motility
motility.table <- table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)
motility.table <- motility.table[c("Motile", "Non-motile"),]
motility.table/rowSums(motility.table)
##
## FALSE TRUE
## Motile 0.7276265 0.2723735
## Non-motile 0.7832293 0.2167707
chisq.test(motility.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: motility.table
## X-squared = 5.768, df = 1, p-value = 0.01632
require(knitr)
kable(motility.table)
| FALSE | TRUE | |
|---|---|---|
| Motile | 374 | 140 |
| Non-motile | 878 | 243 |
# Cellular
cellular.table <- table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)
cellular.table <- cellular.table[as.character(unique(bacteria.df$cell))[1:3],]
cellular.table
##
## FALSE TRUE
## Obligate intracellular 25 28
## Facultative intracellular 48 45
## Extracellular 82 79
chisq.test(cellular.table)
##
## Pearson's Chi-squared test
##
## data: cellular.table
## X-squared = 0.2932, df = 2, p-value = 0.8636
kable(cellular.table)
| FALSE | TRUE | |
|---|---|---|
| Obligate intracellular | 25 | 28 |
| Facultative intracellular | 48 | 45 |
| Extracellular | 82 | 79 |
# Oxygen: consider only aerobic and anerobic
oxygen.table <- table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)
oxygen.table <- oxygen.table[1:2,]
chisq.test(oxygen.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: oxygen.table
## X-squared = 15.227, df = 1, p-value = 9.533e-05
kable(oxygen.table)
| FALSE | TRUE | |
|---|---|---|
| Aerobic | 513 | 135 |
| Anaerobic | 307 | 37 |
# And consider facultatively anaerobic
oxygen.table <- table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)
oxygen.table <- oxygen.table[c("Aerobic", "Anaerobic", "Facultatively anaerobic"),]
chisq.test(oxygen.table)
##
## Pearson's Chi-squared test
##
## data: oxygen.table
## X-squared = 55.754, df = 2, p-value = 7.82e-13
# Gram
gram.table <- table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)
gram.table <- gram.table[c("Gram-negative", "Gram-positive", "Gram-variable", "Lack cell wall"),]
kable(gram.table)
| FALSE | TRUE | |
|---|---|---|
| Gram-negative | 666 | 240 |
| Gram-positive | 572 | 134 |
| Gram-variable | 16 | 1 |
| Lack cell wall | 39 | 17 |
Tables in manuscript, including some additional tables.
Table 1 is a review of previous similar studies from the literature, so has no associated code.
Shows the niche and specificity of pathogens. According to our definitions, a specialist pathogen infects only a single host, a generalist more than one. Generalists are categorised according to whether their hosts are within the same family (e.g. Bovidae), order (e.g. Artiodactyla), or across orders. Percentages are shown in the manuscript but are not given here.
options(warn=-1)
# NICHE
unique.pathogen.names <- unique_pathogens$Species
unique.pathogen.types <- unique_pathogens$Type
unique_pathogens$N.non.human.hosts <- as.numeric(sapply(unique.pathogen.names, function(x) nrow(pathogen_vs_host_db_no_humans[which(pathogen_vs_host_db_no_humans$Species==x),])))
unique_pathogens$human.host <- as.numeric(sapply(unique.pathogen.names, function(x) nrow(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x & pathogen_vs_host_db$HostSpeciesPHB=="Homosapiens"),])))
unique_pathogens$niche <- ifelse(unique_pathogens$human.host==1 & unique_pathogens$N.non.human.hosts==0,
"Human-only",
ifelse(unique_pathogens$human.host==0 & unique_pathogens$N.non.human.hosts!=0,
"Animal-only",
ifelse(unique_pathogens$human.host==1 & unique_pathogens$N.non.human.hosts!=0,
"Zoonotic", "Not classified")))
kable(table(unique_pathogens$niche, unique_pathogens$Type))
| Bacteria | Virus | |
|---|---|---|
| Animal-only | 380 | 540 |
| Human-only | 855 | 133 |
| Zoonotic | 450 | 237 |
# SPECIFICITY
unique_pathogens$specificity <- ifelse(unique_pathogens$N.hosts>1,
"Generalist",
ifelse(unique_pathogens$N.non.human.hosts==1 & unique_pathogens$human.host==0,
"Animal-only specialist",
ifelse(unique_pathogens$N.non.human.hosts==0 & unique_pathogens$human.host==1,
"Human-only specialist", "Not classified")))
all_identical <- function(x) {
if (length(x) == 1L) {
warning("'x' has a length of only 1")
return(TRUE)
} else if (length(x) == 0L) {
warning("'x' has a length of 0")
return(logical(0))
} else {
TF <- vapply(1:(length(x)-1),
function(n) identical(x[[n]], x[[n+1]]),
logical(1))
if (all(TF)) TRUE else FALSE
}
}
within.family <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostFamily"]))
within.order <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostOrder"]))
within.group <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostGroup"]))
unique_pathogens$generalist.narrow.broad <- ifelse(unique_pathogens$N.hosts==1, "Specialist",
ifelse(within.family==TRUE,
"Within-family",
ifelse(within.order==TRUE,
"Within-order",
ifelse(within.family==FALSE & within.order==FALSE & unique_pathogens$N.hosts>1, "Generalist across orders", "Not classified"))))
# SPECIALIST
kable(table(unique_pathogens$specificity, unique_pathogens$Type))
| Bacteria | Virus | |
|---|---|---|
| Animal-only specialist | 231 | 254 |
| Generalist | 599 | 523 |
| Human-only specialist | 855 | 133 |
# GENERALIST NARROW-BROAD
kable(table(unique_pathogens$generalist.narrow.broad, unique_pathogens$Type))
| Bacteria | Virus | |
|---|---|---|
| Generalist across orders | 508 | 307 |
| Specialist | 1086 | 387 |
| Within-family | 49 | 132 |
| Within-order | 42 | 84 |
This is a summary of the best-fit GAMs (see `intermediates/*rds’) for host traits which predict viral/bacterial richness (Figure 4) and pathogen traits which predict zoonotic potential (Figure 5).
Vector-borne pathogens are more likely to be generalists and have a higher median PHB. The below code contains the chi-squared tests mentioned in the manuscript. Supplementary Table 1 is produced from these results, removing those pathogens for which data is not available.
# Virus
table(virus.df$vector.borne)
##
## No Yes Yes?
## 740 167 3
wilcox.test(virus.df$virus.phbs[which(virus.df$vector.borne=="Yes")], virus.df$virus.phbs[which(virus.df$vector.borne=="No")])
##
## Wilcoxon rank sum test with continuity correction
##
## data: virus.df$virus.phbs[which(virus.df$vector.borne == "Yes")] and virus.df$virus.phbs[which(virus.df$vector.borne == "No")]
## W = 90076, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# Vector-borne
quantile(virus.df$virus.phbs[which(virus.df$vector.borne=="Yes")])
## 0% 25% 50% 75% 100%
## 0.0000000 0.0000000 0.4251265 0.5459129 0.8242654
# Not vector-borne
quantile(virus.df$virus.phbs[which(virus.df$vector.borne=="No")])
## 0% 25% 50% 75% 100%
## 0.0000000 0.0000000 0.0000000 0.2748792 1.5404218
# Bacteria
table(bacteria.df$vector.borne)
##
## No Yes Yes?
## 1 1577 105 2
wilcox.test(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="Yes")], bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="No")])
##
## Wilcoxon rank sum test with continuity correction
##
## data: bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne == "Yes")] and bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne == "No")]
## W = 105488, p-value = 1.828e-10
## alternative hypothesis: true location shift is not equal to 0
# Vector-borne
quantile(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="Yes")])
## 0% 25% 50% 75% 100%
## 0.0000000 0.0000000 0.0000000 0.5065082 0.7668872
# Not vector-borne
quantile(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="No")])
## 0% 25% 50% 75% 100%
## 0.0000000 0.0000000 0.0000000 0.0000000 0.9109023
Consider generalist/specialist proportion.
# Virus
virus.vector.table <- table(virus.df$vector.borne, virus.df$virus.phbs>0)
virus.vector.table <- virus.vector.table[c("No", "Yes"),]
chisq.test(virus.vector.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: virus.vector.table
## X-squared = 60.878, df = 1, p-value = 6.073e-15
# Bacteria
bacteria.vector.table <- table(bacteria.df$vector.borne, bacteria.df$bacteria.phbs>0)
bacteria.vector.table <- bacteria.vector.table[c("No", "Yes"),]
chisq.test(bacteria.vector.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: bacteria.vector.table
## X-squared = 42.322, df = 1, p-value = 7.74e-11
Viruses with an RNA genome and larger genome size have a greater host range. Having an RNA genome and a larger genome were both significantly associated with greater mean PHB (p<0.001 for both variables) with a non-significant interaction between them (p=0.36).
# Interaction model
virus.df.model <- virus.df[which(virus.df$genome.size!=""),]
# Interaction
combined.model <- lm(virus.phbs ~ genome.type.rna.dna*genome.size, data=virus.df.model)
summary(combined.model)
##
## Call:
## lm(formula = virus.phbs ~ genome.type.rna.dna * genome.size,
## data = virus.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31203 -0.15312 -0.05045 0.15669 0.66023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04796 0.01529 3.136 0.00179 **
## genome.type.rna.dnaRNA 0.18795 0.03073 6.116 1.65e-09 ***
## genome.size 0.66078 0.14753 4.479 8.85e-06 ***
## genome.type.rna.dnaRNA:genome.size 1.74152 1.91937 0.907 0.36456
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2055 on 655 degrees of freedom
## Multiple R-squared: 0.1771, Adjusted R-squared: 0.1733
## F-statistic: 46.99 on 3 and 655 DF, p-value: < 2.2e-16
The following table is not shown in the main text, but it is interesting to note that genome size is not significantly associated with greater PHB in a univariate model.
# Interaction
univariate.size.model <- lm(virus.phbs ~ genome.size, data=virus.df.model)
summary(univariate.size.model)
##
## Call:
## lm(formula = virus.phbs ~ genome.size, data = virus.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1901 -0.1897 -0.1763 0.1950 0.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19026 0.01027 18.530 <2e-16 ***
## genome.size -0.07526 0.14621 -0.515 0.607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2261 on 657 degrees of freedom
## Multiple R-squared: 0.000403, Adjusted R-squared: -0.001118
## F-statistic: 0.2649 on 1 and 657 DF, p-value: 0.6069
Bacterial motility and cellular lifestyle are not associated with greater host range. Combining motility and cellular proliferation in a linear model with suggests that neither variable is associated with greater mean PHB.
bacteria.df.model <- bacteria.df[which(bacteria.df$cell %in% c("Extracellular", "Facultative intracellular", "Obligate intracellular") &
bacteria.df$motility %in% c("Motile", "Non-motile")),]
model <- lm(bacteria.phbs ~ cell + motility, data=bacteria.df.model)
summary(model)
##
## Call:
## lm(formula = bacteria.phbs ~ cell + motility, data = bacteria.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2886 -0.2635 -0.1811 0.2568 0.6223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26354 0.02840 9.278 <2e-16 ***
## cellFacultative intracellular -0.04182 0.03845 -1.088 0.278
## cellObligate intracellular -0.00428 0.05110 -0.084 0.933
## motilityNon-motile 0.02502 0.03726 0.672 0.502
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2929 on 299 degrees of freedom
## Multiple R-squared: 0.006106, Adjusted R-squared: -0.003866
## F-statistic: 0.6124 on 3 and 299 DF, p-value: 0.6075
Again, univariate linear models and a linear model with an interaction term give the same conclusion.
Cell lifestyle
cell.model <- lm(bacteria.phbs ~ cell, data=bacteria.df.model)
summary(cell.model)
##
## Call:
## lm(formula = bacteria.phbs ~ cell, data = bacteria.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2843 -0.2744 -0.1669 0.2500 0.6365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.274376 0.023353 11.749 <2e-16 ***
## cellFacultative intracellular -0.039739 0.038288 -1.038 0.300
## cellObligate intracellular 0.009906 0.046484 0.213 0.831
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2926 on 300 degrees of freedom
## Multiple R-squared: 0.004607, Adjusted R-squared: -0.002029
## F-statistic: 0.6943 on 2 and 300 DF, p-value: 0.5002
Motility
motility.model <- lm(bacteria.phbs ~ motility, data=bacteria.df.model)
summary(motility.model)
##
## Call:
## lm(formula = bacteria.phbs ~ motility, data = bacteria.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2753 -0.2753 -0.1907 0.2551 0.6356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24949 0.02527 9.873 <2e-16 ***
## motilityNon-motile 0.02585 0.03384 0.764 0.445
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2925 on 301 degrees of freedom
## Multiple R-squared: 0.001935, Adjusted R-squared: -0.001381
## F-statistic: 0.5836 on 1 and 301 DF, p-value: 0.4455
Combined model with interaction term
interaction.model <- lm(bacteria.phbs ~ cell*motility, data=bacteria.df.model)
summary(interaction.model)
##
## Call:
## lm(formula = bacteria.phbs ~ cell * motility, data = bacteria.df.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2937 -0.2596 -0.1862 0.2530 0.6172
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.259634 0.031092 8.351
## cellFacultative intracellular -0.030196 0.053653 -0.563
## cellObligate intracellular -0.009388 0.053745 -0.175
## motilityNon-motile 0.034035 0.047243 0.720
## cellFacultative intracellular:motilityNon-motile -0.023963 0.077047 -0.311
## cellObligate intracellular:motilityNon-motile NA NA NA
## Pr(>|t|)
## (Intercept) 2.59e-15 ***
## cellFacultative intracellular 0.574
## cellObligate intracellular 0.861
## motilityNon-motile 0.472
## cellFacultative intracellular:motilityNon-motile 0.756
## cellObligate intracellular:motilityNon-motile NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2933 on 298 degrees of freedom
## Multiple R-squared: 0.006429, Adjusted R-squared: -0.006908
## F-statistic: 0.4821 on 4 and 298 DF, p-value: 0.7489